Loading Packages
Data Preparation
The data set contains 683 observations of cancer cells with 11 attributes. For the purpose of this analysis ID and Class, variables will not be used, the selected attributes are Clump Thickness, Cell Size, Cell Shape, Adhesion, Epithelial, Bare Nuclei, Chromatin, Nucleoli, Mitoses.
Loading sample data
data.first <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data", header = F)
names1 = c('ID','Clump.Thickness','Cell.Size','Cell.Shape','Adhesion','Epithelial',
'Bare.Nuclei','Chromatin','Nucleoli','Mitoses','Class')
names(data.first) = names1
data.first = subset(data.first, Bare.Nuclei!='?')
data.first$Bare.Nuclei = as.integer(data.first$Bare.Nuclei)
data <- data.first[,2:10]
str(data)
## 'data.frame': 683 obs. of 9 variables:
## $ Clump.Thickness: int 5 5 3 6 4 8 1 2 2 4 ...
## $ Cell.Size : int 1 4 1 8 1 10 1 1 1 2 ...
## $ Cell.Shape : int 1 4 1 8 1 10 1 2 1 1 ...
## $ Adhesion : int 1 5 1 1 3 8 1 1 1 1 ...
## $ Epithelial : int 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare.Nuclei : int 2 3 4 6 2 3 3 2 2 2 ...
## $ Chromatin : int 3 3 3 3 3 9 3 3 1 2 ...
## $ Nucleoli : int 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : int 1 1 1 1 1 1 1 1 5 1 ...
Standardization
In order to apply PCA first, we need to standardize our data set so that each variable has the same scale. This is done by subtracting every mean in each column. The following data is scaled.
scaled.data <- apply(data, 2, scale)
head(scaled.data)
## Clump.Thickness Cell.Size Cell.Shape Adhesion Epithelial Bare.Nuclei
## [1,] 0.1977598 -0.7016978 -0.7412304 -0.63889730 -0.5552016 -0.5655993
## [2,] 0.1977598 0.2770488 0.2625905 0.75747664 1.6939247 -0.1007325
## [3,] -0.5112687 -0.7016978 -0.7412304 -0.63889730 -0.5552016 0.3641343
## [4,] 0.5522740 1.5820442 1.6010185 -0.63889730 -0.1053763 1.2938679
## [5,] -0.1567545 -0.7016978 -0.7412304 0.05928967 -0.5552016 -0.5655993
## [6,] 1.2613024 2.2345419 2.2702324 1.80475710 1.6939247 -0.1007325
## Chromatin Nucleoli Mitoses
## [1,] -0.181694 -0.6124785 -0.3481446
## [2,] -0.181694 -0.2848960 -0.3481446
## [3,] -0.181694 -0.6124785 -0.3481446
## [4,] -0.181694 1.3530163 -0.3481446
## [5,] -0.181694 -0.6124785 -0.3481446
## [6,] 2.267589 1.3530163 -0.3481446
Creating a Covariance Matrix
This matrix will help us to understand the relationships between variables whether they are positively or negatively related or even no relation. The results show that almost all variables are positively correlated however, Chromatin is negatively correlated with Bare Nuclei and Mitoses, also Bare Nuclei and Adhesion have a negative relationship.
cov.data <- cov(scaled.data)
In order to get PCA as a next step, we calculated Eigenvectors and Eigenvalues.
eigen.data <- eigen(cov.data)
head(eigen.data)
## $values
## [1] 5.52692008 0.80235707 0.71804127 0.53138409 0.40352916 0.37046840 0.29654175
## [8] 0.26069582 0.09006237
##
## $vectors
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] -0.3108812 0.1143272940 -0.09151782 0.89047218 -0.17466975 -0.01029507
## [2,] -0.3950105 0.0003807321 -0.12160938 -0.01160560 0.16022440 -0.11835261
## [3,] -0.3901838 0.0310713058 -0.12907908 0.03756402 0.14592803 -0.07155935
## [4,] -0.3367502 -0.1977068657 -0.22381922 -0.32126941 -0.57943497 -0.36686652
## [5,] -0.3498809 -0.1254078557 0.05174948 -0.09522565 0.69935231 -0.34033711
## [6,] -0.2273349 0.7836138121 0.51715909 -0.16809635 -0.13457329 -0.13394910
## [7,] -0.3540839 0.0407963887 -0.30126024 -0.17817141 -0.20070546 0.13280062
## [8,] -0.3510963 0.0180151523 -0.01887128 -0.17039131 0.08456282 0.83211615
## [9,] -0.2427624 -0.5613457636 0.74083319 0.06478402 -0.17928073 0.04847562
## [,7] [,8] [,9]
## [1,] -0.11527159 0.21148911 -0.0183093590
## [2,] 0.08315009 -0.47720596 -0.7447982017
## [3,] 0.05212958 -0.60164979 0.6615157072
## [4,] -0.45580849 0.12483583 0.0228232540
## [5,] -0.12742415 0.47403665 0.0651122268
## [6,] -0.01552206 0.04758685 -0.0007392573
## [7,] 0.75996932 0.33609593 0.0449995260
## [8,] -0.37304914 0.08866639 -0.0205415132
## [9,] 0.19005992 -0.04672169 0.0113760501
Main Components
Since R calculated Eigenvectors in the negative direction, first we multiplied Eigenvalues with -1. And defined our data as loadings.data.
loadings.data <- -eigen.data$vectors
row.names(loadings.data) <- c("Clump.Thickness","Cell.Size","Cell.Shape","Adhesion","Epithelial","Bare.Nuclei","Chromatin","Nucleoli","Mitoses")
colnames(loadings.data) <- c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9")
loadings.data
## PC1 PC2 PC3 PC4 PC5
## Clump.Thickness 0.3108812 -0.1143272940 0.09151782 -0.89047218 0.17466975
## Cell.Size 0.3950105 -0.0003807321 0.12160938 0.01160560 -0.16022440
## Cell.Shape 0.3901838 -0.0310713058 0.12907908 -0.03756402 -0.14592803
## Adhesion 0.3367502 0.1977068657 0.22381922 0.32126941 0.57943497
## Epithelial 0.3498809 0.1254078557 -0.05174948 0.09522565 -0.69935231
## Bare.Nuclei 0.2273349 -0.7836138121 -0.51715909 0.16809635 0.13457329
## Chromatin 0.3540839 -0.0407963887 0.30126024 0.17817141 0.20070546
## Nucleoli 0.3510963 -0.0180151523 0.01887128 0.17039131 -0.08456282
## Mitoses 0.2427624 0.5613457636 -0.74083319 -0.06478402 0.17928073
## PC6 PC7 PC8 PC9
## Clump.Thickness 0.01029507 0.11527159 -0.21148911 0.0183093590
## Cell.Size 0.11835261 -0.08315009 0.47720596 0.7447982017
## Cell.Shape 0.07155935 -0.05212958 0.60164979 -0.6615157072
## Adhesion 0.36686652 0.45580849 -0.12483583 -0.0228232540
## Epithelial 0.34033711 0.12742415 -0.47403665 -0.0651122268
## Bare.Nuclei 0.13394910 0.01552206 -0.04758685 0.0007392573
## Chromatin -0.13280062 -0.75996932 -0.33609593 -0.0449995260
## Nucleoli -0.83211615 0.37304914 -0.08866639 0.0205415132
## Mitoses -0.04847562 -0.19005992 0.04672169 -0.0113760501
Looking at the principal component table we can say that PC1 is equally affected by each variable however Cell size and Cell shape have slightly more effect than others.
PC2 is negatively effected by Bare nuclei and positively effected by Mitoses. These results reveal that the second component decreases as Bare nuclei increase yet that it increases as Mitoses increases. These results are also supported by the covariance matrix.
By looking at the third column the most negatively affected attribute is Mitoses where Nucleoli is the least affected attribute.
The fourth principal component has the highest negatively effected value as Clump thickness(-0.89).
PC5 is negatively effected by attribute Epithelial where Adhesion is the positively affected variable.
Nucleoli is the most negatively correlated variable in PC6.
Chromatin is the most negatively correlated variable in PC7 where Adhesion has a positive effect on PC7.
Cell shape has the biggest positive effect on PC8.
Cell size has the biggest positive effect on PC9 where Bare nuclei have almost no effect on PC9.
The following heatmap shows how variables are related to Principal Components where the relationship increases as the color lighten. The heatmap confirms the assumptions stated below.
loadings.pc <- abs(loadings.data)
heatmap <- heatmap.2(loadings.pc,
dendrogram='none',
Rowv = FALSE,
Colv = FALSE,
trace = 'none',
main = "Principal Component Values",
margins = c(5,10),
cexRow = 1,
cexCol = 1)
Principal Component Scores
In order to form a linear combination with scaled data and loadings data, we calculated principal component scores for each observation.
PC1(pc1.ob1) and PC2(pc2.ob1) are calculated for the first observation.
scaled.data[1,]
## Clump.Thickness Cell.Size Cell.Shape Adhesion Epithelial
## 0.1977598 -0.7016978 -0.7412304 -0.6388973 -0.5552016
## Bare.Nuclei Chromatin Nucleoli Mitoses
## -0.5655993 -0.1816940 -0.6124785 -0.3481446
loadings.data[,1:2]
## PC1 PC2
## Clump.Thickness 0.3108812 -0.1143272940
## Cell.Size 0.3950105 -0.0003807321
## Cell.Shape 0.3901838 -0.0310713058
## Adhesion 0.3367502 0.1977068657
## Epithelial 0.3498809 0.1254078557
## Bare.Nuclei 0.2273349 -0.7836138121
## Chromatin 0.3540839 -0.0407963887
## Nucleoli 0.3510963 -0.0180151523
## Mitoses 0.2427624 0.5613457636
ob1 <- c(0.3108812,0.3950105,0.3901838,0.3367502,0.3498809,0.2273349,0.3540839,0.3510963,0.2427624)
pc_1 <- c(0.1977598,-0.7016978,-0.7412304,-0.6388973,-0.5552016,-0.5655993,-0.1816940,-0.6124785,-0.3481446)
pc1.ob1 <- sum(ob1*pc_1)
pc1.ob1
## [1] -1.406788
ob1 <- c(0.3108812,0.3950105,0.3901838,0.3367502,0.3498809,0.2273349,0.3540839,0.3510963,0.2427624)
pc_2 <- c(-0.1143272940,-0.0003807321,-0.0310713058,0.1977068657,0.1254078557,-0.7836138121,-0.0407963887,-0.0180151523,0.5613457636)
pc2.ob1 <- sum(ob1*pc_2)
pc2.ob1
## [1] -2.557137e-10
Observation one has PC1 score of about -1.406 and PC2 score of about 0.07. These values are the same as in the PC score matrix below.
PC.Matrix <- as.matrix(scaled.data) %*% loadings.data
colnames(PC.Matrix) <- c("PC1", "PC2", "PC3", "PC4", "PC5","PC6","PC7","PC8", "PC9")
The following data frame shows the PC scores for each observation.
PC <- as.data.frame(PC.Matrix)
head(PC)
## PC1 PC2 PC3 PC4 PC5 PC6
## 1 -1.4067883 0.07097607 0.2069485 -0.62378184 0.1500159 -0.08250219
## 2 0.8493507 0.22736674 0.4174616 0.14661590 -0.8822552 1.17259362
## 3 -1.4158510 -0.57651471 -0.3387604 0.16387308 0.1512874 0.03473530
## 4 1.7896182 -1.47930203 -0.1283745 -0.31063920 -0.7263317 -0.87430959
## 5 -1.2818855 0.24954308 0.3307718 -0.08379068 0.4926469 0.16998949
## 6 4.5315544 0.12027485 2.0151713 0.19880092 -0.3431870 0.25489537
## PC7 PC8 PC9
## 1 -0.275192219 -0.3536743 0.021202599
## 2 0.643588317 -0.5743148 -0.085114761
## 3 -0.342491672 -0.2479656 0.008908055
## 4 0.273086909 1.5943855 0.191645782
## 5 0.002181916 -0.3658571 -0.001223227
## 6 -0.274237450 0.2436223 -0.036264969
In order to decide which principal component will be used, we calculated the explained variance. The high score in explained variance shows the significance level and that should be selected as a principal component.
Based on the below results, the first component is the most significant.
PVE <- eigen.data$values / sum(eigen.data$values)
round(PVE, 2)
## [1] 0.61 0.09 0.08 0.06 0.04 0.04 0.03 0.03 0.01
In the following code, I will show how the first and second principal components are plotted. Color bar measures Cell size.
Plots
f <- list(family = "Courier New, monospace", size = 18, color = "black")
x1 <- list(title = "PC1", titlefont = list(size = 10), range = c(-5,5))
y1 <- list(title = "PC2", titlefont = list(size = 10), range = c(-5,5))
xx1 <- list(title = "PC3", titlefont = list(size = 10), range = c(-5,5))
a <- list(text = "PC 1 and 2",
font = f,
xref = "paper",
yref = "paper",
yanchor = "bottom",
xanchor = "center",
align = "center",
x = 0.5,
y = 1,
showarrow = FALSE)
b <- list(text = "PC 2 and 3",
font = f,
xref = "paper",
yref = "paper",
yanchor = "bottom",
xanchor = "center",
align = "center",
x = 0.5,
y = 1,
showarrow = FALSE)
c <- list(text = "PC 1 and 3",
font = f,
xref = "paper",
yref = "paper",
yanchor = "bottom",
xanchor = "center",
align = "center",
x = 0.5,
y = 1,
showarrow = FALSE)
pc1_2 <- plot_ly(PC,
x = ~PC1,
y = ~PC2,
type = 'scatter',
mode = 'markers',
color = data$Adhesion,
colors = "BuGn",
hoverinfo = 'text',
text = ~paste(data.first$ID)) %>%
layout(annotations = a,
xaxis = x1,
yaxis = y1,
showlegend = FALSE)
pc2_3 <- plot_ly(PC,
x = ~PC2,
y = ~PC3,
type = 'scatter',
mode = 'markers',
color = data$Adhesion,
colors = "BuPu",
hoverinfo = 'text',
text = ~paste(data.first$ID)) %>%
layout(annotations = b,
xaxis = y1,
yaxis = xx1,
showlegend = FALSE)
pc1_3 <- plot_ly(PC,
x = ~PC1,
y = ~PC3,
type = 'scatter',
mode = 'markers',
color = data$Adhesion,
colors = "OrRd",
hoverinfo = 'text',
text = ~paste(data.first$ID)) %>%
layout(annotations = c,
xaxis = x1,
yaxis = xx1,
showlegend = FALSE)
subplot(pc1_2, pc2_3, pc1_3, titleX = TRUE, titleY = TRUE)
PC1 and 2
In the first graph, PC1 and 2, the first principal component is effected by Cell size and Cell shape Because these values were positive, each increase of x (PC1) coincides with a increase in these attributes.
Second principal component is most effected by Bare nucluei and suggest that an increase in y(PC2) indicates a decrease in height.
The first two principal components were also effected considerably by Adhesion. This is verified by the colorings of the data points, as there is an evident relationship between the x and y values and the Adhesion value.
When we look at the specific observations such as ID 1320077 and ID 1103722 we observe them on the left side with a light color which means they both have low Adhesion value and low PC1 value which states low cell.size and cell.shape value.
The data set also confirms these values. And the rest of the graphics and data could interpret as the above-mentioned way.
3D These previous graphs are now combined into an interactive three-dimensional graph, shown below. As with the two-dimensional plots, the colorbar meaures Adhesion. Together, these three variables explain 78% of the total variance.